(19) 



J 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 



(12) 



(43) Date of publication: 

30.1 0.2002 Bu lletin 2002/44 



(H) EP 1 253 581 A1 

EUROPEAN PATENT APPLICATION 

(51) lntC!7: G10L 21/02 



00 

lf> 

CO 

in 

CM 



CL 
UJ 



(21) Application number: 01201551.7 

(22) Date of filing: 27.04.2001 



(84) Designated Contracting States: 


(72) 


Inventor: Vetter, Rolf 


AT BE CH CY DE DK ES Fl FR GB GR IE IT LI LU 




1400 Yverdon (CH) 


MC NL PT SE TR 




Designated Extension States: 


(74) 


Representative: Laurent, Jean et al 


AL LT LV MK RO SI 




IC B 






Ingenieurs Conseils en Brevets SA 


(71) Applicant: CSEM 




Rue des Sors 7 


Centre Suisse cf Elect ronique et de 




CH-2074 Marin (CH) 


Microtechnique S.A. 




2007 Neuchatel (CH) 







(54) Method and system for enhancing speech in a noisy environment 



(57) There is described a method and system for en- 
hancing speech in a noisy environment. The method op- 
erates on a frame-to-frame basis and preferably uses a 
Discrete Cosine Transform (DCT) to transform time-do- 
main components of an input signal into frequency-do- 
main components. The speech enhancement method is 
essentially based on a subspace approach in the so- 
called Bark-domain and an optimal subspace selection 
using a Minimum Description Length (MDL) criterion. 

The MDL-based subspace selection leads to a par- 
tition of the multi - dimensional space of noisy data into 
a noise subspace, a signal subspace and a signal-plus- 
noise subspace. The enhanced signal is reconstructed 
by applying the inverse transform to the components of 
the signal subspace and weighted components of the 
signal-plus-noise subspace, the noise subspace being 
nulled during this reconstruction. 

The resulting enhancement method provides max- 
imum noise reduction while minimizing signal distortions 
such as the so-called musical residual noise encoun- 
tered with conventional subtract ive-type enhancement 
methods. 
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Description 

[0001] This invention is in the field of signal processing and is more specifically directed to noise suppression (or, 
conversely, signal enhancement) in the telecommunication of human speech. 
5 [0002] Speech enhancement is often necessary to reduce listener's fatigue or to improve the performance of auto- 
matic speech processing systems. A major class of noise suppression techniques is referred to in the art as spectral 
subtraction. Spectral subtraction, in general, considers the transmitted noisy signal as the sum of the desired speech 
signal with a noise component. 

[0003] A typical approach consists in estimating the spectrum of the noise component and then subtracting this 
10 estimated noise spectrum, in the frequency domain, from the transmitted noisy signal to yield the remaining desired 
speech signal. 

[0004] Subtractive type techniques are typically based on the Discrete Fourier Transform <DFT) and constitute a 
traditional approach for removing stationary background noise in single channel systems. A major problem however 
with most of these methods is that they suffer from a distortion called "musical residual noise". 

15 [0005] To reduce this distortion, a prior art method has been proposed which utilizes the simultaneous masking effect 
of the human ear. It has been observed that the human ear ignores, or at least tolerates, additive noise so long as its 
amplitude remains below a masking threshold in each of multiple critical frequency bands within the human ear. As is 
well known in the art, a critical band is a band of frequencies that are equally perceived by the human ear. N. Virag, 
"Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System", IEEE Transac- 

20 tions on Speech and Audio Processing, Vol. 7, No. 2 (March 1999), pp. 126-137, describes a technique in which 
masking thresholds are defined for each critical band, and are used in optimizing spectral subtraction to account for 
the extent to which noise is masked during speech intervals. 

[0006] Improvements have also been achieved by using eigenspace approaches based on Karhunen-Loeve Trans- 
form (KLT). Y. Ephraim et al., "A Signal Subspace Approach for Speech Enhancement", IEEE Transactions on Speech 

25 and Audio Processing, Vol. 3, No. 4 (July 1995), pp. 251-266, describes a subspace approach based on KLT. The 
underlying principle of this subspace approach is to observe the data in a large dimensional space of delayed coordi- 
nates. Since noise is assumed to be random, it extends approximately in uniform manner in all the directions of this 
space, while in contrast, the dynamics of the deterministic system underlying the speech signal confine the trajectories 
of the useful signal to a lower-dimensional subspace. Consequently, the eigenspace of the noisy signal is partitioned 

30 into a noise subspace and signal-plus-noise subspace. Enhancement is obtained by removing the noise subspace and 
optimally weighting the signal-plus-noise subspace. 

[0007] Notably, it has been shown that highest performance is obtained when using KLT with an associated subspace 
selection using the Minimum Description Length (MDL) criterion. Vetter et al., "Single Channel Speech Enhancement 
Using Principal Component Analysis and MDL Subspace Selection", in Proceedings of the 6 th European Conference 
35 on Speech Communication and Technology (Eurospeech'99), Budapest, Hungary (September 5-9, 1999), vol. 5, pp. 
241 1 -241 4, which is incorporated herein by reference, describes a subspace approach for single channel speech en- 
hancement and speech recognition in highly noisy environments which is based on Principal Component Analysis 
(PCA). According to this particular approach, in order to maximize noise reduction and minimize signal distortion, the 
eigenspace of the noisy data is partitioned into three different subspaces : 

40 

i) a noise subspace which contains mainly noise contributions. These components are nulled during reconstruction; 

ii) a signal subspace containing components with high signal-to-noise ratios (SNRj » 1 ). Components of this sub- 
space are not weighted since they contain mainly components from the original signal. This allows a minimization 
of the signal distortion; and 

45 iii) a signal-plus-noise subspace which includes the components with SNRj~ 1 . The estimation of the dimension 

of this subspace can only be done with a high error probability. Consequently, principal components with SNRj < 
1 may belong to it and a weighting is applied during reconstruction. 

[0008] The general enhancement scheme of this prior art approach is represented in Figure 1 . A detailed description 
so of this enhancement scheme is described in the above-mentioned Vetter et al. reference. 

[0009] The above-cited KLT-based subspace approaches are however not appropriate for real time implementation 
since the eigenvectors or eigenfilters have to be computed during each frame, which implies high computational re- 
quirements. 

[0010] It is thus a principal object of the present invention to provide a method and a system for enhancing speech 
55 in a noisy environment which yields the robustness and efficiency of the KLT-based subspace approaches. 

[0011] It is a further object of the present invention to provide a method and a system for enhancing speech which 
implies low computational requirements and thus allows this method to be implemented and this system to be used 
for real time speech enhancement in real world conditions. 
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[0012] Accordingly, there is provided a method for enhancing speech in a noisy environment the features of which 
are cited in claim 1 . 

[0013] There is also provided a system for enhancing speech in a noisy environment the features of which are cited 
in claim 13. 

5 [0014] Other advantageous embodiments of the invention are the object of the dependent claims. 

[0015] According to the present invention, in order to circumvent the above-mentioned drawback of the KLT-based 
subspace approaches, i.e. the high computational requirements, one uses prior knowledge about perceptual properties 
of the human auditory system. In particular, according to the present invention, one substitutes the eigenfilters in the 
KLT approach by the so-called Bark filters. 

10 [0016] According to a preferred embodiment of the present invention, this Bark filtering is processed in the DCT 
domain, i.e. a Discrete Cosine Transform is performed. It has been shown that DCT provides significantly higher energy 
compaction as compared to the DFT which is conventionally used. In fact its performance is very close to the optimum 
KLT, It will however be appreciated that DFT is equally applicable despite yielding lower performance. 
[0017] The method according to the present invention provides similar performance in terms of robustness and ef- 

15 ficiency with respect to the KLT-based subspace approaches of Ephraim et al. and Vetter et al. In contrast to these 
prior art enhancing methods, the computational load of the method according to the present invention is however 
reduced by an order of magnitude and thus promotes this method as a promising solution for real time speech en- 
hancement. 

[0018] Other aspects, features and advantages of the present invention will be apparent upon reading the following 
20 detailed description of non-limiting examples and embodiments made with reference to the accompanying drawings, 
in which : 

Figure 1 schematically illustrates a prior art speech enhancing scheme based on Karhunen-Loeve Transform KLT, 
or Principal Component Analysis, with an associated Minimum Description Length (MDL) criterion; 
25 - Figure 2 is a block diagram of a single channel speech enhancement system for implementing a first embodiment 
of the method according to the present invention; 

Figure 3 is a flow chart generally illustrating the speech enhancement method of the present invention; 
Figure 4 schematically illustrates a preferred embodiment of a single channel speech enhancing scheme according 
to the present invention based on a Discrete Cosine Transform (DCT); 
30 - Figure 5 illustrate a typical genetic algorithm (GA) cycle which may be used for optimizing the parameters of the 
speech enhancement method of the present invention; 

Figure 6a to 6d are speech spectrograms illustrating the efficiency of the speech enhancing method of the present 
invention, in particular as compared to classical subtract ive-type enhancing scheme using DFT such as non-linear 
spectral subtraction (NSS); 

35 - Figure 6e illustrate the signal and signal-plus-noise subspace dimensions (p, and p£ estimated using the method 
of the present invention; 

Figure 7 is a block diagram of a dual channel speech enhancement system for implementing a second embodiment 
of the method according to the present invention; and 

Figure 8 schematically illustrates a preferred embodiment of a dual channel speech enhancing scheme according 
40 to the present invention based on DCT 

[0019] Figure 2 schematically shows a single channel speech enhancement system for implementing the speech 
enhancement scheme according to the present invention. This system basically comprises a microphone 10 with as- 
sociated amplifying means 11 for detecting the input noisy signals, a filter 12 connected to the microphone 10, and an 
45 analog-to-digital converter (ADC) 14 for sampling and converting the received signal into digital form. The output of 
the ADC 14 is applied to a digital signal processor (DSP) 16 programmed to process the signals according to the 
invention which will be described hereinbelow. The enhanced signals produced at the output of the DSP 1 6 are supplied 
to an end-user system 18 such as an automatic speech processing system. 

[0020] The DSP 1 6 is programmed to perform noise suppression upon received speech and audio input from micro- 
so phone 10. Figure 3 schematically shows the sequence of operations performed by DSP 16 in suppressing noise and 
enhancing speech in the input signal according to a preferred embodiment of the invention which will now be described. 
[0021] As illustrated in Figure 3, the input signal is firstly subdivided into a plurality of frames each comprising N 
samples by typically applying Hanning windowing with a certain overlap percentage. It will thus be appreciated that 
the method according to the present invention operates on a frame-to-frame basis. After this windowing process, in- 
55 dicated 100 in Figure 3, a transform is applied to these N samples, as indicated by step 110, to produce N frequency- 
domain components indicated X(k). 

[0022] These frequency-domain components X(k) are then filtered at step 120 by so-called Bark filters to produce 
A/ Bark components, indicated X(k) Bario for each frame and are then subjected to a subspace selection process 130, 
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which will be described hereinbelow in greater details, to partition the noisy data into three different subspaces, namely 
a noise subspace, a signal subspace and a signal-plus-noise subspace. 

[0023] The enhanced signal is obtained by applying the inverse transform (step 150) to components of the signal 
subspace and weighted components of the signal-plus-noise subspace, the noise subspace being nulled during re- 
construction (step 140). 

[0024] The global framework for the subspace approach according to the present invention is described hereinbelow 
in greater details. In the context of the present invention, one considers the problem of additive noise, which implies 
that the observed noisy signal x(t) is given by : 

x(t)=s{t) + n(t) r = o,..., A/,-1 (1) 

where s(t) is the speech signal of interest, n(i) is a zero mean, additive stationary background noise, and N t is the 
number of observed samples. 

[0025] In a general way, as already mentioned, the basic idea in subspace approaches can be formulated as follows : 
the noisy data is observed in a large m-dimensional space of a given dual domain (for example the eigenspace com- 
puted by KLT as described in Y. Ephraim et al., "A Signal Subspace Approach for Speech Enhancement", cited here- 
inabove). If the noise is random and white, it extends approximately in a uniform manner in all directions of this dual 
domain , while, in contrast, the dynamics of the deterministic system underlying the speech signal confine the trajectories 
of the useful signal to a lower-dimensional subspace of dimension p < m. As a consequence, the eigenspace of the 
noisy signal is partitioned into a noise subspace and a signal-plus-noise subspace. Enhancement is obtained by nulling 
the noise subspace and optimally weighting the signal-plus-noise subspace. 

[0026] The optimal design of such a subspace algorithm is a difficult task. The subspace dimension p should be 
chosen during each frame in an optimal manner through an appropriate selection rule. Furthermore : the weighting of 
the signal-plus-noise subspace introduces a considerable amount of speech distortion. 

[0027] As already mentioned, in order to simultaneously maximize noise reduction and minimize signal distortion, 
there has already been proposed in Vetter et al., "Single Channel Speech Enhancement Using Principal Component 
Analysis and MDL Subspace Selection" (already cited hereinabove and incorporated herein by reference) a promising 
approach consisting in a partition of the eigenspace of the noisy data into three different subspaces, namely : 

i) a noise subspace of dimension m - p 2 which contains mainly noise contributions. These components are nulled 
during reconstruction; 

ii) a signal subspace of dimension p 1 containing components with high signal-to-noise ratios (S/v7? ; » 1). Compo- 
nents of this subspace are not weighted since they contain mainly components from the original signal. This allows 
a minimization of the signal distortion; and 

iii) a signal-plus-noise subspace of dimension p^ - p 1 which includes the components with SNRj= 1 . The estimation 
of the dimension of this subspace can only be done with a high error probability. Consequently, principal compo- 
nents with SNRj < 1 may belong to it and a weighting is applied during reconstruction. 

[0028] A similar approach is used according to the present invention (step 130 in Figure 3) to partition the space of 
noisy data. In classical subspace approaches, components of the dual domain are obtained by applying the eigenvec- 
tors or eigenfilters computed by KLT on the delay embedded noisy data. To avoid the large computational means 
required for these operations, it is proposed, according to the present invention, to use masking properties of the human 
auditory system in order to substitute the eigenfilters of the classical subspace approaches by the so-called Bark filters. 
[0029] Noise masking is a well known feature of the human auditory system. It denotes the fact that the auditory 
system is incapable to distinguish two signals close in the time or frequency domains. This is manifested by an elevation 
of the minimum threshold of audibility due to a masker signal, which has motivated its use in the enhancement process 
to mask the residual noise and/or signal distortion. The most applied property of the human ear is simultaneous mask- 
ing. It denotes the fact that the perception of a signal at a particular frequency by the auditory system is influenced by 
the energy of a perturbing signal in a critical band around this frequency. Furthermore, the bandwidth of a critical band 
varies with frequency, beginning at about 1 00 Hz for frequencies below 1 kHz, and increasing up to 1 kHz for frequencies 
above 4 kHz. 

[0030] From the signal processing point of view the simultaneous masking is implemented by a critical filterbank, 
the so-called Bark filterbank, which gives equal weight to portions of speech with the same perceptual importance. 
According to the invention, the prior knowledge about the human auditory system is used to replace the eigenfilters in 
the KLT approach by Bark filtering. 

[0031] Furthermore, in order to have a maximum energy compaction the filtering is preferably processed in the Dis- 
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crete Cosine Transform (DCT) domain. Indeed, DCT outperforms DFT in terms of energy compaction and its perform- 
ance is very close to the optimum KLT. Again, it will be appreciated that DFT is equally applicable to perform this filtering 
despite being less optimal than DCT. 

[0032] Since Bark filtering is based on energy considerations, this filtering is based on the square of the DCT com-- 
5 ponents. Bark components are thus defined by the following expression : 



6/2 

X{k) Bark = J^GO^iXik-j)} 2 * = 0....,tf-l (2) 
10 j=-b/2 

where b + 1 is the processing-width of the filter, G(j t k) is the Bark filter whose bandwidth depends on k t and X(k) are 
the DCT components defined as : 

15 
20 

where a(0) = VTT/vand a(A) = j2/Mor k* 0. At this point it is important to note that by computing dual domain com- 
ponents as given by expression (2), one obtains a dual domain of dimension m= N. 

[0033] A crucial point in the proposed algorithm is the adequate choice of the dimensions of the signal-plus-noise 
subspace (p 2 ) and signal subspace (p.,). It requires the use of a truncation criterion applicable for short time series. 

25 Among the possible selection criteria, the Minimum Description Length (MDL) criterion has been shown in multiple 
domains to be a consistent model order estimator, especially for short time series. This high reliability and robustness 
of the MDL criterion constitutes the primer motivation for its use in the method of the present invention. To achieve this 
task, it is assumed that the Bark components given by expression (2) above rearranged in decreasing order constitute 
a liable approximation of the principle components of speech. Under this assumption, the following expression is ob- 

30 tained for the MDL in the case of additive white Gaussian noise as described in Vetter et al. cited hereinabove : 
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where /= 1 , 2, M=PfN-p?/2 +p/2 + 1 is the number of free parameters and Xyfor y'= 0,...,A/- 1 are the Bark components 
given by expression (2) rearranged in decreasing order. The parameter y determines the selectivity of MDL. Accordingly, 
the dimensions p 1 and P2 are given by the minimum of MDL(pj) with y = 64 and y = 1 respectively. This choice of y 
involves that the parameter p 1 provides a very parsimonious representation of the signal whereas selects also 
components with signal-to-noise ratios SNRj~ 1 . 

[0034] An important feature of the method according to the present invention resides in the fact that frames without 
any speech activity lead to a null signal subspace. This feature thus yields a very reliable speech/noise detector. This 
information is used in the present invention to update the Bark spectrum and the variance of noise during frames 
without any speech activity, which ensures eventually an optimal signal prewhitening and weighting. Notably, it has to 
be pointed out that the prewhitening of the signal is important since MDL assumes white Gaussian noise. 
[0035] Figure 4 schematically illustrates the proposed enhancement method according to a preferred embodiment 
of the present invention. As illustrated, following a windowing process 200, the time-domain components of the noisy 
signal x(t) are transformed in the frequency-domain (step 210) using DCT to produce frequency-domain components 
indicated X(k). These components are processed using Bark filters (step 220) as described hereinabove to produce 
Bark components as defined in expression (2). These Bark components are subjected to a prewhitening process 230 
to produce components complying with the assumption made for the subsequent subspace selection process 240 
using MDL, namely the fact that MDL assumes white Gaussian noise. The prewhitening process 230 may typically be 
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realized using a so-called whitening filter as described in "Statistical Digital Signal Processing and Modeling", Monson 
H. Hayes, Georgia Institute of Technology, John Wiley & Sons (1996), § 3.5, pp. 104-106. 

[0036] As already described, the MDL-based subspace selection process 240 leads to a partition of the noisy data 
into a noise subspace of dimension N - p^, a signal subspace of dimension p t and a signal-plus-noise subspace of 
5 dimension p^'Pv Tnis process also provides indication of frames without any speech activity since the signal subspace 
is null in that case, i.e. p 1 = p 2 = 0. Speech/noise detection is thus provided at step 280. 

[0037] The enhanced signal is obtained by applying the inverse DCT to components of the signal subspace and 
weighted components of the signal-plus-noise subspace (steps 250 and 260 in Figure 4) followed by overlap/add 
processing (step 300) since Hanning windowing was initially performed at step 200. Using the definition of inverse 
w OCT it can be written as : 



15 



20 



25 



30 



fP2 
%•(')*/,.+ X gj a } (t)X If (5) 



with 



a^) = cos V Vkt ' \ (6) 



where Xjiorj= 1 ,...,A/are the Bark components given by expression (2) rearranged in decreasing order, /yis the index 
of rearrangement and Cy is an appropriate weighting function. 

[0038] This weighting function p y may for instance result of a time autoregressive moving average domain masking 
of the form 



K lagb 

gj(k) = K agj (k - 1) ^ K bi gj{k - 0 (7) 

35 * = 0 

where the non-filtered weighting function has been chosen as follows : 

40 gj = exp{-vj/SNRj\ j = A + l,".»/>2 (8) 

where SA/fl y fory = 0,...,/V- 1 is the estimated local signal-to-noise ratio of each Bark component and the parameter v 
is adjusted through a non-linear probabilistic operator in function of the global signal-to-noise ratio SNR as follows : 

45 



50 

where 

55 



MSNR) if 

f 2 (SNR) if j < p 2 (9) 
MSNR) if p 2 <j <N 



fi = K i\ + *a l°gsig{ k, 3 + K i4 SNR } 



(10) 
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and 



SNR = medim(SNR(k), . . . , SNR(k - lag K )) (1 1 ) 

5 

and SNR(k) is the estimated global logarithmic signal-to-noise ratio. 

[0039] Referring again to Figure 4, it will be seen that the global and local signal-to-noise ratios are estimated at 
steps 270 and 275 respectively for adjusting the above-defined weighting function. Furthermore, these estimations are 
w updated during frames with no speech activity (step 280). 

[0040] In order to obtain highest perceptual performance one may additionally tolerate background noise of a given 
level and use a noise compensation (step 290) of the form: 

?(/) = v 4 i(/) + (l-v 4 )x(0 (12) 



where 

20 

v 4 =f 4 (SNR) (13) 



and / 4 is given by expression (10). 
25 [0041] The above reconstruction scheme contains a large number of unknown parameters, namely: 

(14) 

30 

[0042] This parameter set should be optimised to obtain highest performance. To this effect so-called genetic algo- 
rithms (GA) are preferably applied for the estimation of the optimal parameter set. 

[0043] Genetic algorithms, or GAs, have recently attracted growing interest from the signal processing community 
for the resolution of optimization problems in various application. One may for instance reference to H. Holland, "Ad- 

35 aptation in natural and artificial systems", the University of Michigan Press, Ml, USA (1975), K.S. Tang et al., "Genetic 
algorithms and their applications", IEEE Signal Processing Magazine, vol. 13, no. 6 (November 1996), pp. 22-37, R. 
Vetter et al., "Observer of the human cardiac sympathetic nerve activity using blind source separation and genetic 
algorithm optimization", in the 1 9 th Annual International Conference of the IEEE Engineering in Medicine and Biological 
Society (EMBS), Chicago (1997), pp. 293-296 or R. Vetter, "Extraction of efficient and characteristics features of mul- 

40 tidimensional time series", PhD thesis, EPFL, Lausanne (1999). 

[0044] GAs are search algorithms which are based on the laws of natural selection and evolution of a population. 
They belong to a class of robust optimization techniques that do not require particular constraint, such as for example 
continuity, differentiability and uni-modality of the search space. In this sense, one can oppose GAs to traditional, 
calculus-based optimization techniques which employ gradient-directed optimization. GAs are therefore well suited for 

45 ill-defined problems as the problem of parameter optimization of the speech enhancement method according to the 
present invention. 

[0045] The general structure of a GA is illustrated in Figure 5. A GA operates on a population which comprises a set 
of chromosomes. These chromosomes constitute candidates for the solution of a problem. The evolution of the chro- 
mosomes from current generations (parents) to new generations (offspring) is guided in a simple GA by three funda- 

50 mental operations : selection, genetic operations and replacement. 

[0046] The selection of parents emulates a "survival -of-the-fittest" mechanism in nature. A fitter parent creates 
through reproduction a larger offspring and the chances of survival of the respective chromosomes are increased. 
During reproduction chromosomes can be modified through mutation and crossover operations. Mutation introduces 
random variations into the chromosomes, which provides slightly different features in its offspring. In contrast, crossover 

55 combines subparts of two parent chromosomes and produces offspring that contain some parts of both parent's genetic 
material. Due to the selection process, the performance of the fittest member of the population improves from generation 
to generation until some optimum is reached. Nevertheless, due to the randomness of the genetic operations, it is 
generally difficult to evaluate the convergence behaviour of GAs. Particularly, the convergence rate of GA is strongly 
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influenced by the applied parameter encoding scheme as discussed in C.Z. Janikow et al., "An experimental compar- 
ison of binary and floating point representation in genetic algorithms", in Proceedings of the 4 th International Conference 
on Genetic Algorithms (1 991 ), pp. 31 -36. In classical GAs, parameters are often encoded by binary numbers. However, 
it has been shown in C.Z. Janikow et al. that the convergence of GAs can be improved through floating point repre- 

5 sentation of chromosomes. 

[0047] In the problem at hand, the aim is at estimating the parameters of the proposed speech enhancement method 
to obtain highest performance. The population consists therefore of chromosomes c h i = 1, ...,!_, each one containing 
a set of encoded parameters k of a candidate method. The range of values of these parameters is bounded due to the 
nature of the problem at hand. Th\z, in fact, imposes a bounded searching space, which is a necessary condition for 

io global convergence of GAs. In the optimization problem at hand order to achieve the evolution of the population is 
guided by a specific GA particularly adapted for small populations. 

[0048] This algorithm was first introduced by D.E. Goldberg in "Genetic algorithm in search, optimization, and ma- 
chine learning", Addison Wesley Reading, USA (1 989) and has been shown to provide high performance in numerous 
applications. The algorithm can be summarized as follows : 

75 

Generate randomly an initial population P(O) = [c^.cj, with L an odd integer; 
Compute the fitness F of each chromosomes in the current population; 
Create new chromosomes by applying one of the following operations : 

20 - Elitist strategy : the chromosome with the best fitness goes unchanged into the next generation; 

Mutation : (L-1 )/2 mutations from the fittest chromosome are passed to the next generation. (L-1 )/4 chromosomes 
are created by adding Gaussian noise with a variance a-, to a randomly selected parameter of the fittest chromo- 
some and the same operation with variance o 2 « o-j is performed for the remaining (L-1)/4 chromosomes; 
Crossover: Each chromosome competes with its neighbour. The losers are discarded whereas the winners are 

25 put in a mating pool. From this pool, (L-1)/2 chromosomes are created by crossover operations for the next gen- 

eration; 

Iterate the scheme until convergence is achieved. 

[0049] The central elements in the proposed GA are the elitist survival strategy, Gaussian mutation in a bounded 
30 parameter space, generation of two subpopulations and the fitness functions. The elitist strategy ensures the survival 
of the fittest chromosome. This implies that the parameters with the highest perceptual performance are always prop- 
agated unchanged to the next generation. The bounded parameter space is imposed by the problem at hand and 
together with Gaussian mutation it guarantees that the probability of convergence of the parameters to the optimal 
solution is equal to one for an infinite number of generations. The convergence properties are improved by the gener- 
35 ation of two subpopulations with various random influences a v o 2 . Since o 2 « a 1f the population generated by o 2 
ensures a fast local convergence of the GA. In contrast, the population generated by o A covers the whole parameter 
space and enables the GA to jump out of local minima and converge to the global minimum. 

[0050] A very important element of the GA is the fitness function F, which constitutes an objective measure of the 
performance of the candidates. In the context of speech enhancement, this function should assess the perceptual 
40 performance of a particular set of parameters. Thus, the speech intelligibility index (SI I) as defined by the American 
National Standard ANSI S3.5-1997 has been applied. Eventually, GA optimization has been performed on a database 
consisting of French sentences. 

[0051] With respect to the performance of the speech enhancing method of the present invention, it has been ob- 
served by the authors that subspace approaches generally outperform linear and non-linear subtractive-type methods 
45 using DFT. In particular, subspace approaches yield a considerable reduction of the so-called "musical noise". In a 
qualitative way, this observation has been confirmed by informal listening tests but also through inspections of the 
spectrograms shown in Figures 6a to 6e. 

[0052] Figure 6a schematically shows the speech spectrogram of the original speech signal corresponding to the 
French sentence "Un loup s'est jete immediatement sur la petite chevre". Figure 6b schematically shows the noisy 

so signal (non -stationary factory noise at a segmental input SNR = 10 dB). Figure 6c illustrates the enhanced signal 
obtained using a non-linear spectral subtraction (NSS) using DFT as described in P. Lockwood "Experiments with a 
Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and Projection, for Robust Recognition in Cars", Speech 
Communications (June 1 992), vol. 1 1 , pp. 21 5-228. Figure 6d shows the enhanced signal obtained using the enhancing 
scheme of the present invention and Figure 6e shows the signal and signal-plus-noise subspace dimensions p 1 and 

55 estimated by MDL. 

[0053] The analysis of Figure 6c highlights that NSS provides a considerable amount of residual "musical noise". In 
contrast, Figure 6d underlines the high performance of the proposed approach since it extracts the relevant features 
of the speech signal and reduces the noise to a tolerable level. This high performance in particular confirms the effi- 
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ciency and consistency of the MDL-based subspace method. 

[0054] The method according to the present invention provides similar performance with respect to the subspace 
approach of Ephraim et al. or Vetter et al. which uses KLT. However, it has to be pointed out that the computational 
requirements of the method according to the present invention are reduced by an order of magnitude with respect to 
5 the known KLT-based subspace approaches. 

[0055] Furthermore, an important additional feature of the method according to the present invention is that it is 
highly efficient and robust in detecting speech pauses, even in very noisy -conditions. This can be observed in Figure 
6e for the signal subspace dimension is zero during frames without any speech activity. 

[0056] It will be appreciated that the proposed enhancing method may be applied as part of an enhancing scheme 
10 in dual or multiple channel enhancement systems, i.e. systems relying on the presence of multiple microphones. Anal- 
ysis and combination of the signals received by the multiple microphones enables to further improve the performances 
of the system notably by allowing one to exploit spatial information in order to improve reverberation cancellation and 
noise reduction. 

[0057] Figure 7 schematically shows a dual channel speech enhancement system for implementing a speech en- 
15 hancement scheme according to a second embodiment of the present invention. Similarly to the single channel speech 
enhancement system of Figure 2, this dual channel system comprise first and second channels each comprising a 
microphone 10, 10* with associated amplifying means 1 1 , 1 1 ', a filter 12, 12' connected to the microphone 10,10' and 
an analog-to-digital converter (ADC) 14,14' for sampling and converting the received signal of each channel into digital 
form. The digital signals provides by the ADC's 14, 14' are applied to a digital signal processor (DSP) 1 6 programmed 
20 to process the signals according to the second embodiment which will be described hereinbelow. The enhanced signals 
produced at the output of the DSP 16 are again supplied to an end-user system 18. 

[0058] The underlying principle of the dual channel enhancement method is substantially similar to the principle 
which has been described hereinabove. The dual channel speech enhancement method however makes additional 
use of a coherence function which allows one to exploit the spatial diversity of the sound field. In essence, this method 
25 is a merging of the above-described single channel subspace approach and dual channel speech enhancement based 
on spatial coherence of noisy sound field. With respect to this latter aspect, one may refer to R. Le Bourquin "Enhance- 
ment of noisy speech signals: applications to mobile radio communications", Speech Communication (1996), vol. 18, 
pp. 3-19. 

[0059] Referring to expression (1 ) above, a speech signal s(t) uttered by a speaker is submitted to modifications due 
30 to its propagation. Additionally, some noise is added so that the two resulting signals which are available on the mi- 
crophones can be written as: 

*l(0 = + n x (t) 

35 x 2 (t) = s 2 (t) -h n 2 {t) t = 0 9 ... 9 N t -l (15) 

[0060] The present principle is based on the following assumptions : (a1) The microphones are in the direct sound 
field of the signal of interest, (a2) whereas they are in the diffuse sound field of the noise sources. Assumption (a1) 
requires that the distance between speaker of interest and microphones is smaller than the critical distance whereas 

40 (a2) requires that the distance between noise sources and microphones is larger than the critical distance as specified 
in M. Drews, "Mikrofonarrays und mehrkanalige Signalverarbeitung zur Verbesserung gestorter Sprache", PhD thesis, 
Technische Universitat, Berlin (1 999). This is a plausible assumption for a large number of applications. As an example, 
consider a moderately reverberating room with a volume of 125 m 3 and a reverberation time of 0.2 seconds which 
yields a critical distance of r c - 1.4 m. Consequently, assumption (a1) is verified if the speaker is nearer than r c while 
(a2) requires that the noise sources are at a distance larger than r c . The consequence of (a1 ) is that the contributions 
of the signal of interest stft) and s^t) in the recorded signal are highly correlated. In contrast, (a2) together with a 
sufficient distance between microphones implies that the contributions of noise n 1 (t) and n 2 (t) in the recorded signal 
are weakly correlated. Since signal and noise have generally non-uniform distribution in the time-frequency domain, 
it is advantageous to perform a correlation measure with respect to frequency and time. This leads to the concept of 

50 time adaptive coherence function. 

[0061] Figure 8 schematically illustrates the proposed dual channel speech enhancement method according to a 
preferred embodiment of the invention. The steps which are similar to the steps of Figure 4 are indicated by the same 
reference numerals and are not described here again. As illustrated, following the windowing process 200, the time- 
domain components of the noisy signals xrft) and x^i) are transformed in the frequency-domain (step 21 0) using DCT 

55 and thereafter processed using Bark filtering (step 220) as already explained hereinabove with respect to the single 
channel speech enhancement method. Expressions (2) and (3) above are therefore equally applicable to each of the 
DCT components X t (k) and X 2 (k). Prewhitening (step 230) and subspace selection (step 240) based on the MDL 
criterion (expression (4)) is applied as before. 
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[0062] Similarly, reconstruction of the enhanced signal is obtained by applying the inverse DCT to components of 
the signal subspace and weighted components of the signal-plus-noise subspace as defined by expressions <5), (6) 
and (7) above. 

[0063] The non-filtered weighting function in expression (7) is however modified and uses a coherence function C/ 
5 (step 278) as well as the local SNRj (step 275) of each Bark component as follows : 

gj = exp{- vj i{Cj SNRj)\ j = A +1,...,P2 (16) 

10 where the coherence function C y is evaluated in the Bark domain by : 

1 ^Fp^W 



15 



20 



25 



30 



35 



40 



45 



55 



where 



Bark O 8 ) 

with p, q = 1 , 2. The parameter v in expression (1 6) is adjusted through a non-linear probabilistic operator in function 
of the global signal-to-noise ratio SNH as already defined by expressions (9), (10) and (11) above. 
[0064] Highest perceptual performance may as before be obtained by additionally tolerating background noise of a 
given level and use a noise compensation (step 290) defined in expressions (12) and (13) above. 
[0065] Eventually, a final step may consist in an optimal merging of the two enhanced signals. A weighted-delay- 
and-sum procedure as described in S. Haykin, "Adaptive Filter Theory", Prentice Hall (1991), may for instance be 
applied which yields finally the enhanced signal: 

?(0 = ^W + ^(0 (19) 

where w 1 and w 2 are chosen to optimize the posterior SNR. 

[0066] With respect to the performance of the dual channel speech enhancement method of the present invention, 
it has been observed by the authors that the proposed dual channel subspace approach outperforms classical single 
channel algorithms such the single channel approach based on non-causal Wiener Filtering which is described in J. 
R. Delleretal., "Discrete-Time Processing of Speech Signals", Macmillan Publishing Company, New York (1993). Tests 
have pointed out that the inclusion of the coherence function improves the perceptual performance of the single channel 
subspace approach which has been presented above. 

[0067] Having described the invention with regard to certain specific embodiments, it is to be understood that these 
embodiments are not meant as limitations of the invention. Indeed, various modifications and/or adaptations may be- 
come apparent to those skilled in the art without departing from the scope of the annexed claims. For instance, the 
proposed optimization scheme which uses genetic algorithms shall not be considered as restricting the scope of the 
present invention. Indeed, it will be appreciated that any other appropriate optimization scheme may be applied in 
order to optimise the parameters of the proposed speech enhancement method. 
[0068] Furthermore DCT has been applied to obtain components of the dual domain with in order to have maximum 
energy compaction, but Discrete Fourier Transform DFT is equally applicable despite being less optimal than DCT. 

50 Claims 

1. Method for enhancing speech in a noisy environment comprising the steps of: 



a) sampling a input signal comprising additive noise to produce a series of time-domain sampled components; 

b) subdividing said time-domain components in a plurality of overlapping frames each comprising a number 
N of samples; 

c) for each of said frames, applying a transform to said N time-domain components to produce a series of N 
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frequency-domain components X(k), 

d) applying Bark filtering to said frequency-domain components X(K) to produce Bark components {X(k) Bark ), 
said Bark components being given by the following expression: 



b/2 . 
m B ark = £ G0\ k){X(k - j) } 2 k = 0, . . . , N - 1 
j=-b/2 

where b + / is the processing-width of the filter and G(j, k) is the Bark filter whose bandwidth depends on k, 
said Bark components forming a /V-dimensional space of noisy data; 

e) partitioning said Af-dimensional space of noisy data into three different subspaces, namely: 

a first subspace or noise subspace of dimension N-P2 containing essentially noise contributions with sig- 
nal-to-noise ratios SNRj < 1 ; 

a second subspace or signal subspace of dimension p 1 containing components with signal-to-noise ratios 
SNRj » 1 ; and 

a third subspace or signal-plus-noise subspace of dimension p 2 - p^ containing components with SNRj- 
1 ; and 

f) reconstructing an enhanced signal by applying the inverse transform to the components of said signal sub- 
space and weighted components of said signal-plus-noise subspace. 

Method according to claim 1 , wherein steps a) to f) are performed based on a first and a second input signal 
respectively provided by first and second channels, said reconstructing step f) being performed using a coherence 
function (Cfi based on Bark components (X^k)^^, X 2 (k) Bark ) of said first and second input signal. 

Method according to claim 1 or 2, wherein said partitioning step comprises using a Minimum Description Length, 
or MDL, criterion to determine the dimensions p v p 2 of said subspaces, said MDL criterion being given by the 
following expression : 



MDi(p l ) = -ln 



where /= 1 , 2, M=pf4'p?l2 +p/2 + 1 is the number of free parameters, Xyfor y = 0 N- 1 are the Bark components 

rearranged in decreasing order, and 7 is a parameter determining the selectivity of said MDL criterion. 

Method according to claim 3, wherein said dimensions p 1 and p 2 are given by the minimum of said MDL criterion 
with y = 64 and 7 = 1 respectively. 

Method according to any one of the preceding claims, wherein said transform is a Discrete Cosine Transform (DCT). 

Method according to claim 5, wherein said reconstructing step f) comprises applying the Inverse Discrete Cosine 
Transform to components of said signal subspace and weighted components of said signal-plus-noise subspace, 
said enhanced signal being given by the following expression : 



N 



N 



Pi j=] 
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with 



where kjior j= are the Bark components rearranged in decreasing order, / y is the index of rearrangement 

and p; ; is an appropriate weighting function. 

Method according to claim 6, wherein said weighting function g } is given by the following expression : 



K lagb 

S /*) = K a g j(k - 1 ) £ K bi gj& ~ 0 
i=0 

with 

gj = exp{- VjlSNRj) J = Pi + 1,-..,P2 

where SNRj1orj= 0,...,N - 1 is the estimated signal-to-noise ratio of each Bark component and parameter v is 
adjusted through a non-linear probabilistic operator in function of the global signal-to-noise ratio SNR, the param- 
eters K a , K tagb and K b1 to K btagb , being selected to optimize the speech enhancement method. 

Method according to claim 6, steps a) to 0 being performed based on a first and a second input signal respectively 
provided by first and second channels, said reconstructing step f) being performed using a coherence function (Cy) 
based on Bark components (Xf(*)Ba/*> X 2(^Bark) °f said first and second input signal, wherein said weighting 
function g l is given by the following expression : 



K Iagb 

gj (k) = K agj (k - 1) X K bi gy(k - 0 

with 

gj = exp{- vj l(Cj SNRj)\ j = ft + 1,...,/>2 



where said coherence function C y is evaluated in the Bark domain by : 
where 
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and where SNRjtor j = 0,...,A/ - 1 is the estimated signal-to-noise ratio of each Bark component and parameter v 
5 is adjusted through a non-linear probabilistic operator in function of the global signal-to-noise ratio SNR, the pa- 

rameters K a , K tagb and K b1 to K bfagb , being selected to optimize the speech enhancement method. 



9. Method according to claim 7 or 8, wherein said parameter v is adjusted as follows: 



10 



15 



MSNR) ifjZp, 
f 2 (SNR) if J < p 2 
f 3 {SNR) if Pi < j ^'W 



where 



f} = K h + logsigfK^ + SNR) 

and 

SNR = we6\an(SNR(k) : ... t SNR(k - lag K )) 



where SNR(k) is the estimated global logarithmic signal-to-noise ratio and the parameters K tff k 12 , .... K44 are 
selected to optimize the speech enhancement method. 

30 10. Method according to claim 9 } wherein the parameters K iagb , K b1 to K bfagb , and K 11t k 12 , are optimized by 

means of a so-called genetic algorithm. 



11. Method according to claim 9 or 10 } further comprising a noise compensation step of the form : 
where 

v 4 =f 4 (SNR) 



and f 4 is given by the expression defined in claim 9. 

45 12. Method according to claim 8, further comprising a merging of a first enhanced signal reconstructed from compo- 
nents derived from said first channel and of a second enhanced signal reconstructed from components derived 
from said second channel. 



13. System for enhancing speech in a noisy environment comprising : 

50 

means (10, 11, 12; 10', 1V, 12') for detecting an input signal comprising a speech signal and additive noise; 
means (14; 14') for sampling and converting said input signal into a series of time-domain sampled compo- 
nents; and 

digital signal processing means (16) for processing said series of time-domain sampled components and pro- 
55 ducing an enhanced signal substantially representative of the speech signal contained in said input signal, 

characterized in that said digital processing means (16) are programmed to perform the speech enhance- 
ment method according to any of the preceding claims. 
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